2019-02-20

Goals for today

session 0

  • reminders
    • about the tidyverse
    • tidy data
    • R base
  • prepare for next week
    • case study on the gapminder dataset
    • interactive session

Tidyverse workflow

David Robinson

Tidyverse, packages in processes

https://www.tidyverse.org/

Tidyverse components

core / extended

Core

  • ggplot2, for data visualization
  • dplyr, for data manipulation
  • tidyr, for data tidying
  • readr, for data import
  • purrr, for functional programming
  • tibble, for tibbles, a modern re-imagining of data frames

source: http://tidyverse.tidyverse.org/. H.Wickham

Extended

  • Working with specific types of vectors:
    • hms, for times
    • stringr, for strings
    • lubridate, for date/times
    • forcats, for factors
  • Importing other types of data:
    • feather, for sharing data
    • haven, for SPSS, SAS and Stata files
    • httr, for web apis
    • jsonlite for JSON
    • readxl, for .xls and .xlsx files
    • rvest, for web scraping
    • xml2, for XML files
  • Modelling
    • modelr, for modelling within a pipeline
    • broom, for models -> tidy data

Tidyverse criticism

dialect

Tidyverse criticism

controversy

Tidyverse criticism

jobs

Personal complains

  • still young so change quickly and drastically. Backward compatibility is not always maintained.
  • tibbles are nice but a lot of non-tidyverse packages require matrices. rownames still an issue.

Anyway, learning the tidyverse does not prevent to learn R base, it helps to get things done early in the process

Tidyverse

trends

Tidyverse

trends

Pipes with magrittr

developed by Stefan Milton Bache

compare the approaches

  • classic parenthesis syntax
  • magrittr pipeline

R base

set.seed(12)
round(mean(rnorm(5)), 2)
[1] -0.76

magrittr

set.seed(12)
rnorm(5) %>%
  mean() %>%
  round(2)
[1] -0.76

Of note, magrittr needs to loaded with either:

library(magrittr)
library(dplyr)
library(tidyverse)

tidy data

Learning objectives

Definitions

  • Principles of tidy data to structure data
  • Find errors in existing data sets
  • Structure data
  • Reshaping data with tidyr

Comments

  • Cleaning data also requires dplyr
  • tidyr and dplyr are intertwined
  • Focus on “tidy data”
  • Introduction of tidyr ways

Basic: everything in the data rectangle

with header

Rationale

All happy families are alike; each unhappy family is unhappy in its own way. Leo Tolstoy, Anna Karenina

Semantics

Definitions

  • Variable: A quantity, quality, or property that you can measure.
  • Observation: A set of values that display the relationship between variables. To be an observation, values need to be measured under similar conditions, usually measured on the same observational unit at the same time.
  • Value: The state of a variable that you observe when you measure it.

source: Garret Grolemund and vignette("tidy-data")

Definition

Tidy data

  1. Each variable is in its own column
  2. Each observation is in its own row
  3. Each value is in its own cell

Convert Long / wide format

The wide format is generally untidy but found in the majority of datasets

Animation

by Garrick Aden-Buie

Demo with the iris dataset

gather

head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

untidy!

4 measurements in columns

library("tidyverse")
iris_melt <- iris %>%
  rownames_to_column(var = "number") %>%
  as_tibble() %>%
  gather(flower, measure, contains("al"))
iris_melt
# A tibble: 600 x 4
   number Species flower       measure
   <chr>  <fct>   <chr>          <dbl>
 1 1      setosa  Sepal.Length     5.1
 2 2      setosa  Sepal.Length     4.9
 3 3      setosa  Sepal.Length     4.7
 4 4      setosa  Sepal.Length     4.6
 5 5      setosa  Sepal.Length     5  
 6 6      setosa  Sepal.Length     5.4
 7 7      setosa  Sepal.Length     4.6
 8 8      setosa  Sepal.Length     5  
 9 9      setosa  Sepal.Length     4.4
10 10     setosa  Sepal.Length     4.9
# … with 590 more rows

tidy

Each variable in its own column

Demo with the iris dataset

spread

iris_melt %>%
  spread(flower, measure)
# A tibble: 150 x 6
   number Species    Petal.Length Petal.Width Sepal.Length Sepal.Width
   <chr>  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
 1 1      setosa              1.4         0.2          5.1         3.5
 2 10     setosa              1.5         0.1          4.9         3.1
 3 100    versicolor          4.1         1.3          5.7         2.8
 4 101    virginica           6           2.5          6.3         3.3
 5 102    virginica           5.1         1.9          5.8         2.7
 6 103    virginica           5.9         2.1          7.1         3  
 7 104    virginica           5.6         1.8          6.3         2.9
 8 105    virginica           5.8         2.2          6.5         3  
 9 106    virginica           6.6         2.1          7.6         3  
10 107    virginica           4.5         1.7          4.9         2.5
# … with 140 more rows

Why tidy is useful?

iris %>%
  gather(flower, measure, -Species) %>%
  ggplot() +
  geom_boxplot(aes(x = Species, y = measure, fill = flower))

Functional programming

Learning objectives

You will learn to:

  • use a functional programming approach to simplify your code
  • pass functions as arguments to higher order functions
  • use purrr::map to replace for loops
  • use purrr::map together with nested tibbles

Reminder

vector & list

Atomic vectors

Each atomic vector contains only a single type of data

  • The type of each atom is the same
  • The size of each atom is 1 (single element)

Example

# Logical
c(TRUE, FALSE, TRUE)
[1]  TRUE FALSE  TRUE
# double
c(1, 5, 7)
[1] 1 5 7
# character
c("character", "sequence")
[1] "character" "sequence" 

Reminder

vector & list

Coercion

  • Is the conversion between types
  • Can be
    • explicit (using as.*() functions)
    • implicit

Explicit

v_example <- c(1, 3, 7)
str(as.character(v_example))
 chr [1:3] "1" "3" "7"

Implicit

v_example <- c(1, 3)
str(v_example)
 num [1:2] 1 3
str(c(v_example, "seven"))
 chr [1:3] "1" "3" "seven"

Adapted from the tutorial of Jennifer Bryan

Reminder

vector & list

Lists

  • are vectors which are not atomic.
    • Elements can be of different types
    • The length of each element (atom) might be greater than 1.

Example

my_list <- list(1, 3, "seven")
str(my_list)
List of 3
 $ : num 1
 $ : num 3
 $ : chr "seven"
is.vector(my_list)
[1] TRUE
is.atomic(my_list)
[1] FALSE

Adapted from the tutorial of Jennifer Bryan

Reminder

functions

Declared function

my_function <- function(my_argument) {
  my_argument + 1
}
  • is defined in the global environment

    ls()
    [1] "my_function"
    ls.str()
    my_function : function (my_argument)  
  • is reusable

    my_function(2)
    [1] 3

Anonymous functions

  • Are not stored in an object and are used “on the fly”
(function(x) { x + 1 })(2)
[1] 3
  • Does not alter the global environment

    ls()
    [1] "my_function"
    # remove the previous my_function to convince you
    rm(my_function)
    (function(x) { x + 1 })(2)
    [1] 3
    ls()
    character(0)

Purrr

Purrr

functional programming

Overview

purrr enhances R’s functional programming toolkit by providing a complete and consistent set of tools for working with functions and vectors (purrr overview on github page)

functional programming is a programming paradigm – a style of building the structure and elements of computer programs – that treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data Wikipedia

Purrr

focuses on functions

functional programming features

  • pass functions as arguments to other functions
  • code that does not change state (a same call should yield the same result)

Use it for iterations

FOR EACH x DO f

Iteration

the LEGO example

Consider a hypothetic put_on function

put_on(x, antenna)

legos

antenna

lego with antenna

Iteration

the LEGO example

Illustration

legos

antenna

legos with antennas

LEGO pictures from Jennifer Bryan

for loop approach

out <- vector("list", length(legos))
for (i in seq_along(legos)) {
  out[[i]] <- put_on(legos[[i]], antenna)
}
out

functional programming approach

antennate <- function(x) put_on(x, antenna)
map(legos, antennate)

Your turn

Reminder

  • a data frame is a list

    is.list(mtcars)
    [1] TRUE
  • Each column represents an element of the list i.e. a data frame is a list of columns

Exercise

Calculate the mean of each column of the mtcars dataset.

Your turn

Using purrr::map()

map(mtcars, mean) %>%
  str()
List of 11
 $ mpg : num 20.1
 $ cyl : num 6.19
 $ disp: num 231
 $ hp  : num 147
 $ drat: num 3.6
 $ wt  : num 3.22
 $ qsec: num 17.8
 $ vs  : num 0.438
 $ am  : num 0.406
 $ gear: num 3.69
 $ carb: num 2.81

for loops

means <- vector("list", ncol(mtcars))
for (i in seq_along(mtcars)) {
  means[i] <- mean(mtcars[[i]])
}
# need to manually add names
names(means) <- names(mtcars)
means %>%
  str()
List of 11
 $ mpg : num 20.1
 $ cyl : num 6.19
 $ disp: num 231
 $ hp  : num 147
 $ drat: num 3.6
 $ wt  : num 3.22
 $ qsec: num 17.8
 $ vs  : num 0.438
 $ am  : num 0.406
 $ gear: num 3.69
 $ carb: num 2.81

For loops are fine

growing

for_loop <- function(x) {
  res <- c()
  for (i in seq_len(x)) {
    res[i] <- i
  }
}

alloc

for_loop <- function(x) {
  res <- vector(mode = "integer", 
                length = x)
  for (i in seq_len(x)) {
    res[i] <- i
  }
}

Rcpp

library(Rcpp)
cppFunction("NumericVector rcpp(int x) {
  NumericVector res(x);
  for (int i=0; i < x; i++) {
    res[i] = i;
  }
}")

Of course someone has to write loops. It doesn’t have to be you - Jenny Bryan

Iteration

the LEGO example

Illustration

legos

antenna

map(legos, antennate)

LEGO pictures from Jennifer Bryan

purrr::map() is type stable

  • applies a function to each element of a list / vector
  • returns a list
  • map(YOUR_LIST, YOUR_FUNCTION)

base::apply() vs purrr::map()

Warning

  • apply() family of function is inconsistent and some members are not type stable.
  • Exception: lapply()
  • apply() family still useful to avoid dependencies (package development)

The apply family of functions

function call input output
apply() apply(X, MARGIN, FUN, ...) array vector or array or list
lapply() lapply(X, FUN, ...) list list
sapply() sapply(X, FUN, ...) list vector or array or list
vapply() vapply(X, FUN, FUN.VALUE, ...) list specified

base::apply() vs purrr::map()

Example

apply()

apply(mtcars, 2, mean) %>% str()
 Named num [1:11] 20.09 6.19 230.72 146.69 3.6 ...
 - attr(*, "names")= chr [1:11] "mpg" "cyl" "disp" "hp" ...

lapply()

lapply(mtcars, mean) %>% str()
List of 11
 $ mpg : num 20.1
 $ cyl : num 6.19
 $ disp: num 231
 $ hp  : num 147
 $ drat: num 3.6
 $ wt  : num 3.22
 $ qsec: num 17.8
 $ vs  : num 0.438
 $ am  : num 0.406
 $ gear: num 3.69
 $ carb: num 2.81

base::apply() vs purrr::map()

Example data

Let’s create the following tibbles:

tib_1 <- tibble(a = 1:3, b = 4:6, c = 7:9)
tib_2 <- tibble(a = 1:3, b = 4:6, c = 11:13)

troublemaker function

troublemaker <- function(x) {
  if (any(x > 10)) return(NULL)
  sum(x)
}

troublemaker_chr <- function(x) {
  if (any(x > 10)) return(as.character(sum(x)))
  sum(x)
}

Define some troublemaker functions:

  • returns the sum of the input vector
  • if any number is greater than 10:
    • troublemaker() returns NULL
    • troublemaker_chr() coerces the sum to a character

base::apply() vs purrr::map()

troublemaker example

troublemaker(1:3)
[1] 6
troublemaker(11:13)
NULL
troublemaker_chr(1:3)
[1] 6
troublemaker_chr(11:13)
[1] "36"

lapply()

lapply(tib_1, troublemaker) %>% str()
List of 3
 $ a: int 6
 $ b: int 15
 $ c: int 24
lapply(tib_2, troublemaker) %>% str()
List of 3
 $ a: int 6
 $ b: int 15
 $ c: NULL
lapply(tib_2, troublemaker_chr) %>% str()
List of 3
 $ a: int 6
 $ b: int 15
 $ c: chr "36"

map()

map(tib_1, troublemaker) %>% str()
List of 3
 $ a: int 6
 $ b: int 15
 $ c: int 24
map(tib_2, troublemaker) %>% str()
List of 3
 $ a: int 6
 $ b: int 15
 $ c: NULL
map(tib_2, troublemaker_chr) %>% str()
List of 3
 $ a: int 6
 $ b: int 15
 $ c: chr "36"

base::apply() vs purrr::map()

apply()

apply(tib_1, 2, troublemaker) %>% str()
 Named int [1:3] 6 15 24
 - attr(*, "names")= chr [1:3] "a" "b" "c"
apply(tib_2, 2, troublemaker) %>% str()
List of 3
 $ a: int 6
 $ b: int 15
 $ c: NULL
apply(tib_2, 2, troublemaker_chr) %>% str()
 Named chr [1:3] "6" "15" "36"
 - attr(*, "names")= chr [1:3] "a" "b" "c"

Warning

In addition to the confusing margin argument, apply() simplifies the output value: read the help ?apply.

If each call to FUN returns a vector of length n, then apply returns an array of dimension c(n, dim(X)[MARGIN]) if n > 1. If n equals 1, apply returns a vector if MARGIN has length 1 and an array of dimension dim(X)[MARGIN] otherwise. If n is 0, the result has length 0 but not necessarily the ‘correct’ dimension.
If the calls to FUN return vectors of different lengths, apply returns a list of length prod(dim(X)[MARGIN]) with dim set to MARGIN if this has length greater than one.
In all cases the result is coerced by as.vector to one of the basic vector types before the dimensions are set, so that (for example) factor results will be coerced to a character array. value definition in ?apply

purrr::map()

map() variants

purrr provides variants coercing the output to the desired output. Generates an error if the mapped function produces an unexpected output.

map_dbl()

map_dbl(tib_1, troublemaker) %>%
  str()
 Named num [1:3] 6 15 24
 - attr(*, "names")= chr [1:3] "a" "b" "c"
map_dbl(tib_2, troublemaker) %>%
  str()
Error: Result 3 must be a single double, not NULL of length 0
map_dbl(tib_2, troublemaker_chr) %>%
  str()
Error: Can't coerce element 3 from a character to a double

base::apply() vs purrr::map()

The purrr::map() family of functions

  • are designed to be consistent
  • map() is the general function and close to base::lapply()
  • map() introduces shortcuts (absent in lapply())
  • variants to specify the type of vectorized output:
    • map_lgl()
    • map_int()
    • map_dbl()
    • map_chr()
    • map_df()

purrr::map() example

example dataset: generate a list of data frames

  • Let’s split the mtcars dataset by each value of cylinder

    spl_mtcars <- group_split(mtcars, cyl)
    str(spl_mtcars, max.level = 1)
    List of 3
     $ :Classes 'tbl_df', 'tbl' and 'data.frame':   11 obs. of  11 variables:
     $ :Classes 'tbl_df', 'tbl' and 'data.frame':   7 obs. of  11 variables:
     $ :Classes 'tbl_df', 'tbl' and 'data.frame':   14 obs. of  11 variables:
  • On each element (dataframe) of the list, we would like to:
    • fit a linear model (miles per gallon explained by the weight)
    • extract the \(r^2\) value

From R for Data Science

group_split() comes in the recent dplyr version 0.8

Reminder

Fit a linear model

  • Using the mtcars dataset we can fit a linear model to explain the miles per gallon (mpg) by the weight (wt) using:
lm(mpg ~ wt, data = mtcars)
Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344  

Reminder, lm outputs complex objects

Summarise a linear model

base::summary() summarises the model:

lm(mpg ~ wt, data = mtcars) %>%
  summary()
Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Reminder

base::summary() generates a list

lm(mpg ~ wt, data = mtcars) %>%
  summary() %>%
  str(max.level = 1, give.attr = FALSE)
List of 11
 $ call         : language lm(formula = mpg ~ wt, data = mtcars)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ wt
 $ residuals    : Named num [1:32] -2.28 -0.92 -2.09 1.3 -0.2 ...
 $ coefficients : num [1:2, 1:4] 37.285 -5.344 1.878 0.559 19.858 ...
 $ aliased      : Named logi [1:2] FALSE FALSE
 $ sigma        : num 3.05
 $ df           : int [1:3] 2 30 2
 $ r.squared    : num 0.753
 $ adj.r.squared: num 0.745
 $ fstatistic   : Named num [1:3] 91.4 1 30
 $ cov.unscaled : num [1:2, 1:2] 0.38 -0.1084 -0.1084 0.0337

Extract \(r^2\)

fit_summary <- summary(lm(mpg ~ wt, data = mtcars))
fit_summary$r.squared
[1] 0.7528328

purrr::map() example

1 - map the linear model

  • map(YOUR_LIST, YOUR_FUNCTION)
  • YOUR_LIST = spl_mtcars
  • YOUR_FUNCTION can be an anonymous function (declared on the fly)
map(spl_mtcars, function(x) lm(mpg ~ wt, data = x))
[[1]]

Call:
lm(formula = mpg ~ wt, data = x)

Coefficients:
(Intercept)           wt  
     39.571       -5.647  


[[2]]

Call:
lm(formula = mpg ~ wt, data = x)

Coefficients:
(Intercept)           wt  
      28.41        -2.78  


[[3]]

Call:
lm(formula = mpg ~ wt, data = x)

Coefficients:
(Intercept)           wt  
     23.868       -2.192  

purrr::map() example

map(spl_mtcars,
    function(x) lm(mpg ~ wt, data = x))
[[1]]

Call:
lm(formula = mpg ~ wt, data = x)

Coefficients:
(Intercept)           wt  
     39.571       -5.647  


[[2]]

Call:
lm(formula = mpg ~ wt, data = x)

Coefficients:
(Intercept)           wt  
      28.41        -2.78  


[[3]]

Call:
lm(formula = mpg ~ wt, data = x)

Coefficients:
(Intercept)           wt  
     23.868       -2.192  

purrr::map() example

2 - extract \(r^2\)

  • map uses a list as an argument and returns a list
  • we can map a new function on the output of the previous call
  • works well in a pipeline (using %>%)
spl_mtcars %>% 
  map(function(x) lm(mpg ~ wt, data = x)) %>%
  map(summary) %>%
  map(function(x) x$r.squared)
[[1]]
[1] 0.5086326

[[2]]
[1] 0.4645102

[[3]]
[1] 0.4229655

Tip

The code above can be simplified using shortcuts provided by purrr

purrr::map() shortcuts

Anonymous functions

  • One sided formula create anonymous functions:
    • define the function using ~
    • use the placeholder .x to refer to the current list element (.x represents the argument of the anonymous function)
    map(YOUR_LIST, function(x) lm(mpg ~ wt, data = x))
    
    # is equivalent to:
    
    map(YOUR_LIST, ~ lm(mpg ~ wt, data = .x))

Initial code

spl_mtcars %>% 
  map(function(x) lm(mpg ~ wt, data = x)) %>%
  map(summary) %>%
  map(function(x) x$r.squared)

With anon. function shortcuts

spl_mtcars %>% 
  map(~lm(mpg ~ wt, data = .x)) %>%
  map(summary) %>%
  map(~.x$r.squared)

purrr::map() shortcuts

Named components

  • Use a string to extract named components

    map(spl_mtcars, "mpg")
    [[1]]
     [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4
    
    [[2]]
    [1] 21.0 21.0 21.4 18.1 19.2 17.8 19.7
    
    [[3]]
     [1] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 13.3 19.2 15.8 15.0
  • Use a number to extract by index

    map(spl_mtcars, 1)
    [[1]]
     [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4
    
    [[2]]
    [1] 21.0 21.0 21.4 18.1 19.2 17.8 19.7
    
    [[3]]
     [1] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 13.3 19.2 15.8 15.0

purrr::map() shortcuts

Initial code

spl_mtcars %>% 
  map(function(x) lm(mpg ~ wt, data = x)) %>%
  map(summary) %>%
  map(function(x) x$r.squared)
[[1]]
[1] 0.5086326

[[2]]
[1] 0.4645102

[[3]]
[1] 0.4229655

Using shortcuts 1

spl_mtcars %>% 
  map(~lm(mpg ~ wt, data = .x)) %>%
  map(summary) %>%
  map(~.x$r.squared)
[[1]]
[1] 0.5086326

[[2]]
[1] 0.4645102

[[3]]
[1] 0.4229655

Using shortcuts 2

spl_mtcars %>% 
  map(~lm(mpg ~ wt, data = .x)) %>%
  map(summary) %>%
  map("r.squared")
[[1]]
[1] 0.5086326

[[2]]
[1] 0.4645102

[[3]]
[1] 0.4229655

purrr::map_*() variants

using map()

spl_mtcars %>% 
  map(~lm(mpg ~ wt, data = .x)) %>%
  map(summary) %>%
  map("r.squared") %>% 
  str()
List of 3
 $ : num 0.509
 $ : num 0.465
 $ : num 0.423

using map_dbl()

spl_mtcars %>% 
  map(~lm(mpg ~ wt, data = .x)) %>%
  map(summary) %>%
  map_dbl("r.squared") %>% 
  str()
 num [1:3] 0.509 0.465 0.423

purrr::map() and mutliple arguments

Additional function arguments

You can use purrr::map() shortcuts to pass additional (constant) arguments

Initial example

antennate <- function(x) put_on(x, antenna)
map(legos, antennate)

Using anonymous function

map(legos, ~ put_on(.x, antenna))

Shortcut for additional arguments

map(legos, put_on, antenna)

Iterating on 2 lists

map() for a single list

map(legos, antennate)

map2() for two lists

enhair <- function(x, y) put_on(x, y)
map2(legos, hairs, enhair)

Iterating on several lists

pmap(.l, .f, ...)

  • Supply the different lists as a list of list (.l)
  • Placeholder in the anonymous function is: ..1, ..2, ..3, …

pmap() example

my_list <- list(
  list(1, 5, 7),
  list(6, 10, 9),
  list(1, 2, 0.5)
     )

pmap(my_list,
     function(a, b, c) {seq(from = a, to = b, by = c)}
     ) %>%
  str()
List of 3
 $ : num [1:6] 1 2 3 4 5 6
 $ : num [1:3] 5 7 9
 $ : num [1:5] 7 7.5 8 8.5 9
pmap(my_list, ~seq(from = ..1, to = ..2, by = ..3)) %>%
  str()
List of 3
 $ : num [1:6] 1 2 3 4 5 6
 $ : num [1:3] 5 7 9
 $ : num [1:5] 7 7.5 8 8.5 9

Mapping a function for its side-effects

side-effects

A function might be called for its side-effects:

  • output on screen
  • save files to disk

Use the walk family of function

  • walk(), walk2(), pwalk()
  • returns the input list
  • can be used in a pipeline (%>%)

The tidyverse focuses on data frames

pros

  • easy
  • works with dplyr
  • nicely structured

cons

  • same length requirement
  • atomic vectors

solution

  • Use lists! But inside the tidyverse
    • tibble introduces list-columns
    • works with dplyr
    • groups are respected
    • easily created with group_nest()
    • perfect input for purrr::map()

Lists as a column in a tibble

Example

tibble(numbers = 1:8,
       my_list = list(a = c("a", "b"), b = 2.56, 
                      c = c("a", "b"), d = rep(TRUE, 4),
                      d = 2:3, e = 4:6, f = "Z", g = 1:4))
# A tibble: 8 x 2
  numbers my_list  
    <int> <list>   
1       1 <chr [2]>
2       2 <dbl [1]>
3       3 <chr [2]>
4       4 <lgl [4]>
5       5 <int [2]>
6       6 <int [3]>
7       7 <chr [1]>
8       8 <int [4]>

Example

Rewriting our previous example

Nesting the tibble by cylinder

mtcars %>%
  group_nest(cyl)
# A tibble: 3 x 2
    cyl data              
  <dbl> <list>            
1     4 <tibble [11 × 10]>
2     6 <tibble [7 × 10]> 
3     8 <tibble [14 × 10]>

Example

Rewriting our previous example

Use mutate and map

mtcars %>%
  group_nest(cyl) %>%
  mutate(model = map(data, ~lm(mpg ~ wt, data = .x)),
         summary = map(model, summary),
         r_squared = map_dbl(summary, "r.squared"))
# A tibble: 3 x 5
    cyl data               model    summary          r_squared
  <dbl> <list>             <list>   <list>               <dbl>
1     4 <tibble [11 × 10]> <S3: lm> <S3: summary.lm>     0.509
2     6 <tibble [7 × 10]>  <S3: lm> <S3: summary.lm>     0.465
3     8 <tibble [14 × 10]> <S3: lm> <S3: summary.lm>     0.423
  • very powerful
  • data rectangle
  • next lecture will show you how dplyr, tidyr, tibble, purrr and broom nicely work together

Purrr

Pure functions

A function is called “pure” if all its inputs are declared as inputs - none of them are hidden - and likewise all its outputs are declared as outputs Kris Jenkins

“Purify” functions

Pure functions

Impure function

  • has hidden inputs or outputs
  • hidden inputs and/or outputs are called side-effects
start <- 10

impure <- function(x) {
  print(start)
  x + start
}

result <- impure(2)
[1] 10
result
[1] 12

Pure function

  • no hidden inputs and/or outputs
pure <- function(x, start) {
  x + start
}

result <- pure(2, start)
result
[1] 12

“Purify” functions

Example

  • Even log() has side-effects
  • Use purrr::safely() to catch every output

Impure log()

(res <- log(10))
[1] 2.302585
res <- log("a")
Error in log("a"): non-numeric argument to mathematical function
res
[1] 2.302585

Purified log()

safe_log <- purrr::safely(log)
(res <- safe_log(10))
$result
[1] 2.302585

$error
NULL
res <- safe_log("a")
res
$result
NULL

$error
<simpleError in .Primitive("log")(x, base): non-numeric argument to mathematical function>

Many more verbs

Wrap up

map() or walk()

map2() or walk2()

pmap()

Pictures from Lise Vaudor’s blog

Don’t forget vectorisation

Warning

Don’t overmap functions!
Use map only if required (non vectorised function)

example

nums <- sample(1:10,
               size = 1000,
               replace = TRUE) 

log_vec <- log(nums)
log_map <- map_dbl(nums, log)

identical(log_vec, log_map)
[1] TRUE

benchmark

bench::mark(
  vectorised = log(nums),
  mapped = map_dbl(nums, log)) %>%
  autoplot()

Before we stop

Acknowledgments

  • Jennifer Bryan (LEGO pictures, courtesy CC licence)
  • Hadley Wickham
  • David Robinson
  • Eric Koncina (iosp R package for slides)
  • Ian Lyttle
  • Robert Rudis

Further reading

Slides online